training prompt
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- Asia > Middle East > Jordan (0.04)
- Asia > China > Guangdong Province > Shenzhen (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)
- North America > United States (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Learning to Reason for Factuality
Chen, Xilun, Kulikov, Ilia, Berges, Vincent-Pierre, Oğuz, Barlas, Shao, Rulin, Ghosh, Gargi, Weston, Jason, Yih, Wen-tau
Reasoning Large Language Models (R-LLMs) have significantly advanced complex reasoning tasks but often struggle with factuality, generating substantially more hallucinations than their non-reasoning counterparts on long-form factuality benchmarks. However, extending online Reinforcement Learning (RL), a key component in recent R-LLM advancements, to the long-form factuality setting poses several unique challenges due to the lack of reliable verification methods. Previous work has utilized automatic factuality evaluation frameworks such as FActScore to curate preference data in the offline RL setting, yet we find that directly leveraging such methods as the reward in online RL leads to reward hacking in multiple ways, such as producing less detailed or relevant responses. We propose a novel reward function that simultaneously considers the factual precision, response detail level, and answer relevance, and applies online RL to learn high quality factual reasoning. Evaluated on six long-form factuality benchmarks, our factual reasoning model achieves an average reduction of 23.1 percentage points in hallucination rate, a 23% increase in answer detail level, and no degradation in the overall response helpfulness.
- Europe > Austria > Vienna (0.14)
- Europe > France (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (5 more...)
- Law Enforcement & Public Safety > Crime Prevention & Enforcement (1.00)
- Government > Immigration & Customs (0.93)
- Law > Criminal Law (0.93)
The Devil is in the Prompts: Retrieval-Augmented Prompt Optimization for Text-to-Video Generation
Gao, Bingjie, Gao, Xinyu, Wu, Xiaoxue, Zhou, Yujie, Qiao, Yu, Niu, Li, Chen, Xinyuan, Wang, Yaohui
The evolution of T ext-to-video (T2V) generative models, trained on large-scale datasets, has been marked by significant progress. However, the sensitivity of T2V generative models to input prompts highlights the critical role of prompt design in influencing generative outcomes. Prior research has predominantly relied on Large Language Models (LLMs) to align user-provided prompts with the distribution of training prompts, albeit without tailored guidance encompassing prompt vocabulary and sentence structure nuances. T o this end, we introduce RAPO, a novel Retrieval-Augmented Prompt O ptimization framework. In order to address potential inaccuracies and ambiguous details generated by LLM-generated prompts. RAPO refines the naive prompts through dual optimization branches, selecting the superior prompt for T2V generation. The first branch augments user prompts with diverse modifiers extracted from a learned relational graph, refining them to align with the format of training prompts via a fine-tuned LLM. Conversely, the second branch rewrites the naive prompt using a pre-trained LLM following a well-defined instruction set. Extensive experiments demonstrate that RAPO can effectively enhance both the static and dynamic dimensions of generated videos, demonstrating the significance of prompt optimization for user-provided prompts.
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Shen, Wei, Liu, Guanlin, Wu, Zheng, Zhu, Ruofei, Yang, Qingping, Xin, Chao, Yue, Yu, Yan, Lin
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.
- Law (1.00)
- Education (1.00)
- Water & Waste Management > Solid Waste Management (0.94)
PAFT: Prompt-Agnostic Fine-Tuning
Wei, Chenxing, Shu, Yao, Ou, Mingwen, He, Ying Tiffany, Yu, Fei Richard
While Large Language Models (LLMs) adapt well to downstream tasks after fine-tuning, this adaptability often compromises prompt robustness, as even minor prompt variations can significantly degrade performance. To address this, we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach that dynamically adjusts prompts during fine-tuning. This encourages the model to learn underlying task principles rather than overfitting to specific prompt formulations. PAFT operates in two stages: First, a diverse set of meaningful, synthetic candidate prompts is constructed. Second, during fine-tuning, prompts are randomly sampled from this set to create dynamic training inputs. Extensive experiments across diverse datasets and LLMs demonstrate that models trained with PAFT exhibit strong robustness and generalization across a wide range of prompts, including unseen ones. This enhanced robustness improves both model performance and inference speed while maintaining training efficiency. Ablation studies further confirm the effectiveness of PAFT.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (7 more...)
- Research Report (1.00)
- Overview (0.67)
R.I.P.: Better Models by Survival of the Fittest Prompts
Yu, Ping, Yuan, Weizhe, Golovneva, Olga, Wu, Tianhao, Sukhbaatar, Sainbayar, Weston, Jason, Xu, Jing
Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.
- North America > Canada > Ontario > Toronto (0.04)
- North America > United States > New York (0.04)
- North America > United States > California (0.04)
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback
Ivison, Hamish, Wang, Yizhong, Liu, Jiacheng, Wu, Zeqiu, Pyatkin, Valentina, Lambert, Nathan, Smith, Noah A., Choi, Yejin, Hajishirzi, Hannaneh
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories.
- North America > United States (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Asia > Middle East > Jordan (0.04)